Target of this Project

In this project,I will analyze the Red Wine Data and try to understand which variables are responsible for the quality of the wine.

Date set

The data can be downloaded frome this link.

Also read this text file to creating effective Plots

The data-set contains 11 chemical characteristics beside a quality from 1 to 10 from at least 3 wine experts for 1599 different wines.

Exploring data

The data has 1599 observations of 13 variables. The type of data in each colum is as follow:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Also the units of each column:

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - (g / dm^3)
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume) Output variable (based on sensory data):
  12. quality (score between 0 and 10)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Univariate Plots

Lets look closer on each variable alone,these density plots shows the normal distribution for each variable.

Fixed acidity is the non volatile acids present in wine.Wines in this data set has an average fixed acidity of 8.32 g/dm^3.We can see from the above histogram that the distribution is abit negatively skewed,indicating the presence of a few outliers with high amounts of fixed acidity.

The volatile acidity of the wine is the mount of acetic acid in wine,which at too high of leverls can lead to an upleasant,vinegar taste.Wines in this data set have an average volaitle acidity of 0.52 g/dm^3.From the distribution,we see that like fixed acidity,the volatile is also negatively skewed,with a few wines having high volatile acidity(outliers).We suspect that these wines might be of low quality.

Citric acid is found in small quantities.It can add freshness and flavor to the wine.We see that there are less number of wines with higher levels of citric acid.On average,wines in this data set have 0.27 g/dm^3 of citric acid.

Residual sugar,which is the amount of sugar that remains after fermentaiton stops,has a heavily skewed long tailed distribution with many outliers.

Chlorides,which is the amount of salt in the wine,is also a heavily skewed distribution,similar to residual sugar.There are many outliers.On average,wines in this dataset have 0.08 g/dm^3 of cholrides in them.As we can see from the plot,there are outliers that go as high as 0.6 g/dm^3 of chlorides.

Free sulfur dioxide:the free form of SO2 exists in equilibrium between molecular so2(as a dissolved gas) and bisulfite ion;it prevents micobial growth and the oxidation of wine.There are more wines in the dataset with low levels of free sulfur dioxide,than those with more.On average,wines contain 15.87 mg/dm^3 of free sulfur dioxide.

This is the amount of free and bound forms of sulfur dioxide.Similar to free sulfur dioxide,the distribution of total sulfur dioxide is also positively skewed with few wines with extreme values of toral sulfur dioxide.there are two large outliers in this dataset as can be seen from the below box plot.

Density of water in the wine is one of the few normally distributed variables in this dataset.The median and mean is roughly the same(0.99 g/cm^3)

pH describes how acidic or basic the wine is on a scale of 9(very acidic) to 14(very basic).Most wine fall in the 3-4 range.

Sulphates refer to additives that can contribute to sulfur dioxide in the wine. The distribution of sulphates is positively skewed with a few outliers. The average amout of sulphates is 0.6 g/dm^3.

There are less number of wine with high % of alcohol content in them. Average alcohol content is around 10.5%.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

One thing I’m seeing from the above plot is most of the wines in the dataset are 5 - 6 socre,So I’m wondering whether this data collected is accurate or complete or not.Was this data collected from a specific geographical location?Or was it spread around a big area?As the good quality and the poor quality wines are almost like outliers here,it might be difficult to get an accurate model of the wine quality.Let’s look at the other plots.

Univariate Analysis

Lets focus on quality. Although quality are supposed to be from 0 to 10,all records are from 3 to 8,then I seperated quality level to Bad,Average and good.

## 
##     Bad Average    Good 
##      63    1319     217

82.5% of wines either havve quality of average.

What is the structure of your dataset?

The Red Wine Dataset had 1599 rows and 13 columns originally. After I added a new column called ‘quality_level’, the number of columns became 14. Here our categorical variable is ‘quality’, and the rest of the variables are numerical variables which reflect the physical and chemical properties of the wine.

I also see that in this dataset, most of the wines belong to the ‘average’ quality with very few ‘bad’ and ‘good’ ones. Now this again raises my doubt if this dataset is a complete one or not. For the lack of these data, it might be challenging to build a predictive model as I don’t have enough data for the Good Quality and the Bad Quality wines.

What is/are the main feature(s) of interest in your dataset?

My main point of interest in this dataset is the ‘quality’. I would like to determine which factors determine the quality of a wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Without analyzing the data, I think maybe the acidity(fixed, volatile or citric) will change the quality of wine based on their values. Also pH as related to acidity may have some effect on the quality. Also this would be an interesting thing to see how the pH is affected by the different acids present in the wine and if the overall pH affects the quality of the wine. I also think the residual sugar will have an effect on the wine quality as sugar determines how sweet the wine will be and may adversely affect the taste of the wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new variable “bound.sulfur.dioxide” to divide total sulfur dioxide into two parts: the free one and the bound one, thus investigate them apartly in following explorations.And I also created’quality_level’by classical quality level into three groups:Bad,average and Good.

Bivariate Plots Section

Let’s zoom into the correlation between quality and the chemical characteristics :

## 
## ---------------------------------------------------------------------------
##                        fixed.acidity   volatile.acidity   citric.acid 
## -------------------------- --------------- ------------------ -------------
##     **fixed.acidity**             1             -0.2561        **0.6717**  
## 
##    **volatile.acidity**        -0.2561             1           **-0.5525** 
## 
##      **citric.acid**         **0.6717**       **-0.5525**           1      
## 
##     **residual.sugar**         0.1148           0.001918         0.1436    
## 
##       **chlorides**            0.09371           0.0613          0.2038    
## 
##  **free.sulfur.dioxide**       -0.1538          -0.0105         -0.06098   
## 
##  **total.sulfur.dioxide**      -0.1132          0.07647          0.03553   
## 
##        **density**            **0.668**         0.02203        **0.3649**  
## 
##           **pH**             **-0.683**          0.2349        **-0.5419** 
## 
##       **sulphates**             0.183            -0.261        **0.3128**  
## 
##        **alcohol**            -0.06167          -0.2023          0.1099    
## 
##        **quality**             0.1241         **-0.3906**        0.2264    
## 
##  **bound.sulfur.dioxide**     -0.07815          0.09703          0.06678   
## ---------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ------------------------------------------------------------------------------
##                        residual.sugar   chlorides    free.sulfur.dioxide 
## -------------------------- ---------------- ------------ ---------------------
##     **fixed.acidity**           0.1148        0.09371           -0.1538       
## 
##    **volatile.acidity**        0.001918        0.0613           -0.0105       
## 
##      **citric.acid**            0.1436         0.2038          -0.06098       
## 
##     **residual.sugar**            1           0.05561            0.187        
## 
##       **chlorides**            0.05561           1             0.005562       
## 
##  **free.sulfur.dioxide**        0.187         0.005562             1          
## 
##  **total.sulfur.dioxide**       0.203          0.0474         **0.6677**      
## 
##        **density**            **0.3553**       0.2006          -0.02195       
## 
##           **pH**               -0.08565        -0.265           0.07038       
## 
##       **sulphates**            0.005527      **0.3713**         0.05166       
## 
##        **alcohol**             0.04208        -0.2211          -0.06941       
## 
##        **quality**             0.01373        -0.1289          -0.05066       
## 
##  **bound.sulfur.dioxide**       0.1745        0.05548         **0.4251**      
## ------------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##                        total.sulfur.dioxide     density         pH      
## -------------------------- ---------------------- ------------- -------------
##     **fixed.acidity**             -0.1132           **0.668**    **-0.683**  
## 
##    **volatile.acidity**           0.07647            0.02203       0.2349    
## 
##      **citric.acid**              0.03553          **0.3649**    **-0.5419** 
## 
##     **residual.sugar**             0.203           **0.3553**     -0.08565   
## 
##       **chlorides**                0.0474            0.2006        -0.265    
## 
##  **free.sulfur.dioxide**         **0.6677**         -0.02195       0.07038   
## 
##  **total.sulfur.dioxide**            1               0.07127      -0.06649   
## 
##        **density**                0.07127               1        **-0.3417** 
## 
##           **pH**                  -0.06649         **-0.3417**        1      
## 
##       **sulphates**               0.04295            0.1485        -0.1966   
## 
##        **alcohol**                -0.2057          **-0.4962**     0.2056    
## 
##        **quality**                -0.1851            -0.1749      -0.05773   
## 
##  **bound.sulfur.dioxide**        **0.9577**          0.09513       -0.1081   
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------------------------
##                        sulphates      alcohol       quality   
## -------------------------- ------------ ------------- -------------
##     **fixed.acidity**         0.183       -0.06167       0.1241    
## 
##    **volatile.acidity**       -0.261       -0.2023     **-0.3906** 
## 
##      **citric.acid**        **0.3128**     0.1099        0.2264    
## 
##     **residual.sugar**       0.005527      0.04208       0.01373   
## 
##       **chlorides**         **0.3713**     -0.2211       -0.1289   
## 
##  **free.sulfur.dioxide**     0.05166      -0.06941      -0.05066   
## 
##  **total.sulfur.dioxide**    0.04295       -0.2057       -0.1851   
## 
##        **density**            0.1485     **-0.4962**     -0.1749   
## 
##           **pH**             -0.1966       0.2056       -0.05773   
## 
##       **sulphates**             1          0.09359       0.2514    
## 
##        **alcohol**           0.09359          1        **0.4762**  
## 
##        **quality**            0.2514     **0.4762**         1      
## 
##  **bound.sulfur.dioxide**    0.03224       -0.2232       -0.2055   
## -------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -------------------------------------------------
##                        bound.sulfur.dioxide 
## -------------------------- ----------------------
##     **fixed.acidity**             -0.07815       
## 
##    **volatile.acidity**           0.09703        
## 
##      **citric.acid**              0.06678        
## 
##     **residual.sugar**             0.1745        
## 
##       **chlorides**               0.05548        
## 
##  **free.sulfur.dioxide**         **0.4251**      
## 
##  **total.sulfur.dioxide**        **0.9577**      
## 
##        **density**                0.09513        
## 
##           **pH**                  -0.1081        
## 
##       **sulphates**               0.03224        
## 
##        **alcohol**                -0.2232        
## 
##        **quality**                -0.2055        
## 
##  **bound.sulfur.dioxide**            1           
## -------------------------------------------------

Let’s zoom into the correlation between quality and the chemical characteristics :

variable Pearson corr
fixed.acidity 0.12
volatile.acidity -0.39
citric.acid 0.23
residual.sugar 0.01
chlorides -0.13
free.sulfur.dioxide -0.05
total.sulfur.dioxide -0.19
density -0.17
pH -0.06
sulphates 0.25
alcohol 0.48
bound.sulfur.dioxide -0.2

As we can see the only relatively good correlation is with the alcohol percentage.

One other way to see the relations is by drawing boxplots . The following graphs represents boxplots between each quality level [3-8], versus each chemical.

The two magenta lines represent the 10% and 90% . The red line represents the median [50%]. the black points inside the boxplots and the line attaching them to each other represent the mean for each quality level.

The mean increases from level 4 to 7 .

The mean decreases from level 3 to 7, and increases a little to 8.

The mean remains the same from 3 to 4 then increases to 7 then remains to 8 .

The mean slightly decreases from 3 to 8.

The mean significantly decreases from 3 to 4, then slowly decreases all over the way to 8.

The mean increases from 3 to 5, then decreases from 5 to 8.

The same as free sulfur dioxide, the mean increase from 3 to 5, then decreases from 5 to 8.

This is new variable I added which is call the bound.sulfur.dioxide,the mean increase from 3 to 5 then decreases from 5 to 8.this variable comes from total sulfur dioxide and the mean changes same as free and total sulfur dioxide.it’s difficult to definite it’s a factor to effect the quality of wine.

The mean decreases from 3 to 4 , and from 5 to 8, but increases from 4 to 5.

The mean remains the same between 3 to 4 , and 5 to 6, and decreases otherwise.

The mean slowly increases all over the way.

The mean significantly increases from 5 to 8, and from 3 to 4 , but decreases from 4 to 5.

So why we are doing that, lets remember what we are seeking for, we want relations between alcohol and the chemical properties. Correlations gave us the relation with alcohol only but no the others. But when we saw the boxplots we saw many increases and decreases from different quality level, and we saw the relation between quality and alcohol isn’t perfectly positive.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  1. Fixed Acidity seems to have almost no effect on quality.
  2. Volatile Acidity seems to have a negative correlation with the quality.
  3. Better wines seem to have higher concentration of Citric Acid.
  4. Better wines seem to have higher alcohol percentages.
  5. Even though it’s a weak correlation, but lower percent of Chloride seems to produce better quality wines.
  6. Better wines seem to have lower densities. But then again, this may be due to the higher alcohol content in them.
  7. Better wines seem to be more acidic.
  8. Residual sugar almost has no effect on the wine quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity had a positive correlation with pH which at first was totally unexpected to me.

What was the strongest relationship you found?

Alcohol is the first thing that comes to mind when I think of wine, so I wanted to see the relationship between the quality of wine and its alcohol content. Red wine of higher quality seem to have more alcohol content. The relationship is not a perfect linear one because most the wine are of medium quality (5,6) and the alcohol content in quality 6 is more spread out than in quality 5. However, there are more instances of high alcohol winerated at a higher quality. Correlation is not particularly high (0.5). However, we can clearly see from the boxplots that the average alcohol content goes higher as we go from mid to top quality wines.

Multivariate Plots Section

alcohol vs volatile.acidity,frome the multivariate plots,the good quality wines have lower volatile.acidity.

Added sulphates variable to analysis which repesent contain high sulphates wines have good quality.

From different quality of wines,free sulfur dioxide with total sulfur dioxide have postive relation.but sulphates without any relate with free and total sulfur dioxide.

Quality is positively correlated with alcohol,there are a few drop=off points above and below the linear line. alcohol is negatively correlated with density.

Building Linear regression Model

After we proved the relation between quality and chemical properties, lets build a regression model so in future if we have chemical properties for some wine, we can predict it’s quality.

Now lets look at the model :

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH, 
##     data = training_data)
## 
## ====================================================================================================
##                          m1            m2           m3           m4           m5           m6       
## ----------------------------------------------------------------------------------------------------
##   (Intercept)           2.155***      1.727***     2.866***     2.973***     2.497***     3.494***  
##                        (0.220)       (0.224)      (0.247)      (0.254)      (0.287)      (0.515)    
##   alcohol               0.333***      0.320***     0.286***     0.284***     0.296***     0.339***  
##                        (0.021)       (0.021)      (0.020)      (0.020)      (0.020)      (0.021)    
##   sulphates                           0.855***     0.599***     0.650***     0.667***     0.733***  
##                                      (0.126)      (0.124)      (0.127)      (0.126)      (0.129)    
##   volatile.acidity                                -1.153***    -1.279***    -1.352***               
##                                                   (0.124)      (0.143)      (0.144)                 
##   citric.acid                                                  -0.231       -0.629***               
##                                                                (0.132)      (0.174)                 
##   fixed.acidity                                                              0.058***               
##                                                                             (0.017)                 
##   pH                                                                                     -0.569***  
##                                                                                          (0.149)    
## ----------------------------------------------------------------------------------------------------
##   R-squared             0.209         0.245        0.308        0.310        0.319        0.256     
##   adj. R-squared        0.208         0.243        0.306        0.307        0.315        0.254     
##   sigma                 0.707         0.691        0.662        0.661        0.657        0.686     
##   F                   252.335       155.125      141.769      107.317       89.264      109.700     
##   p                     0.000         0.000        0.000        0.000        0.000        0.000     
##   Log-likelihood    -1027.549     -1004.996     -963.139     -961.610     -955.575     -997.782     
##   Deviance            478.652       456.660      418.487      417.154      411.937      449.841     
##   AIC                2061.098      2017.992     1936.279     1935.219     1925.150     2005.565     
##   BIC                2075.695      2037.456     1960.608     1964.415     1959.211     2029.894     
##   N                   959           959          959          959          959          959         
## ====================================================================================================

Analysis of the Multivariate Plots

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

1.High Alcohol and Sulaphate content seems to produce better wines. 2.Density,even though weakly correlated plays in improving the wine quality.

Were there any interesting or surprising interactions between features?

Through multivariate analysis,I’m surpirsing volatile acidity,sulfur and acohol have correlated player in wines quality.and volatile acidity repesent sour,sulfur repsent salty.there were the biggest factor for taste of wines.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a couple of linear models. But the main problem was there was not enough statistic to have a significant confidence level in the equations produced. Because of the low R squared value, I saw that alcohol contributes to only 22% of the Wine Quality and the most of the factors converged on the Average quality wines. This can be due to the fact that our dataset comprised mainly of ‘Average’ quality wines and as there were very few data about the ‘Good’ and the ‘Bad’ quality wines in the training dataset, that’s why it was difficult to predict statistics for the edge cases. Maybe a more complete dataset would have helped me better in predicting the higher range values.

Final Plots and Summary

Plot 1

This plot tells us that Alcohol percentage has played a big role in determining the quality of Wines. The higher the alcohol percentage, the better the wine quality. In this dataset, even though most of the data pertains to average quality wine, we can see from the above plot that the mean and median coincides for all the boxes implying that for a particular Quality it is very normally distributed. So a very high value of the median in the best quality wines imply that almost all points have a high percentage of alcohol. But previously from our linear model test, we saw from the R Squared value that alcohol alone contributes to about 22% in the variance of the wine quality. So alcohol is not the only factor which is responsible for the improvement in Wine Quality.

Plot 2

In this plot, we see that the best quality wines have high values for both Alcohol percentage and Sulphate concentration implying that High alcohol contents and high sulphate concentrations together seem to produce better wines. Although there is a very slight downwards slope maybe because in best quality wines, percentage of alcohol is slightly greater than the concentration of Sulphates.

Plot3

We see that the error is much more dense in the ‘Average’ quality section than the ‘Good’ and the ‘Bad’ quality wines. This is evident from the fact that most of our dataset contains ‘Average’ quality wines and there is not too many data in the extreme ranges. The linear model with the R squared value for m5 could only explain around 33% change in quality. Also the earlier models clearly shows that due to the lack of information, it is not the best model to predict both ‘Good’ and ‘Bad’ quality wines.

Reflections

This dataset is about red wines, containing 1599 observations of 13 variables. Although none of the observations contain NAs. But it lacks of categorical variables. So I create two new categorical variables the one called quality_level. the other one called bound sulfur dioxide from total sulfur dioxide by free sulfur dioxide.

I begin my exploration by investigating indiviual variables, trying to figure out their distributions by histograms, count the number of wines by different levels.

Then I create a correlation and scatterplots matrix to see if there are some correlations between variables. I was surprised at the beginning that there’s no strong correlations between quality and other chemicals. with bound sulfur dioxide moderately, so I investigate some related variables with quality levels. Then I explore the relathionships between the two categorical variables by mosaic plot, finding that alcohol correlates with quality level to some extent. This is an important clue for further exploration.

One of the limitations of the dataset is that it is too small to have only 1599 observations. Maybe with a much larger dataset we can find more interesting things or stronger correlations. And when the number of variables becomes larger and larger, it is difficult to find the inner relationships by just doing data analysis, maybe we need some advanced techniques such as machine learning(even deep learning). So the future work includes collecting more data and more variables, finding another dataset about white wines and then doing a joint analysis, or applying some machine learning techniques to help us figure out deeply hidden patterns and so on.